This report explores a dataset containing loan attributes and respective borrower details for approximately 114,000 loans
Below are the dimensions and the variables for the data set
## [1] 113937 81
## [1] "ListingKey"
## [2] "ListingNumber"
## [3] "ListingCreationDate"
## [4] "CreditGrade"
## [5] "Term"
## [6] "LoanStatus"
## [7] "ClosedDate"
## [8] "BorrowerAPR"
## [9] "BorrowerRate"
## [10] "LenderYield"
## [11] "EstimatedEffectiveYield"
## [12] "EstimatedLoss"
## [13] "EstimatedReturn"
## [14] "ProsperRating..numeric."
## [15] "ProsperRating..Alpha."
## [16] "ProsperScore"
## [17] "ListingCategory..numeric."
## [18] "BorrowerState"
## [19] "Occupation"
## [20] "EmploymentStatus"
## [21] "EmploymentStatusDuration"
## [22] "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup"
## [24] "GroupKey"
## [25] "DateCreditPulled"
## [26] "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper"
## [28] "FirstRecordedCreditLine"
## [29] "CurrentCreditLines"
## [30] "OpenCreditLines"
## [31] "TotalCreditLinespast7years"
## [32] "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment"
## [34] "InquiriesLast6Months"
## [35] "TotalInquiries"
## [36] "CurrentDelinquencies"
## [37] "AmountDelinquent"
## [38] "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years"
## [40] "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance"
## [42] "BankcardUtilization"
## [43] "AvailableBankcardCredit"
## [44] "TotalTrades"
## [45] "TradesNeverDelinquent..percentage."
## [46] "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio"
## [48] "IncomeRange"
## [49] "IncomeVerifiable"
## [50] "StatedMonthlyIncome"
## [51] "LoanKey"
## [52] "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled"
## [54] "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed"
## [58] "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing"
## [60] "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber"
## [62] "LoanMonthsSinceOrigination"
## [63] "LoanNumber"
## [64] "LoanOriginalAmount"
## [65] "LoanOriginationDate"
## [66] "LoanOriginationQuarter"
## [67] "MemberKey"
## [68] "MonthlyLoanPayment"
## [69] "LP_CustomerPayments"
## [70] "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees"
## [72] "LP_ServiceFees"
## [73] "LP_CollectionFees"
## [74] "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss"
## [76] "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded"
## [78] "Recommendations"
## [79] "InvestmentFromFriendsCount"
## [80] "InvestmentFromFriendsAmount"
## [81] "Investors"
The dataset has 81 variables. I have selected 15 variables that I believe are unrelated and will provide distinct variables for analysis. After a quick glance at the dataset, I realise that the columns “CreditGrade” and “ProsperRating (Alpha)” both provide same information, with the only difference that “CreditGrade” is filled for loans listed pre-2009 and the other for post - 2009. I have checked that there are no loans with both fields filled but there are loans with no value in both fields.
## [1] "CreditGrade levels:"
## [1] "" "A" "AA" "B" "C" "D" "E" "HR" "NC"
## [1] "ProsperRating (Alpha) levels: "
## [1] "" "A" "AA" "B" "C" "D" "E" "HR"
## [1] "CreditGrade variable information:"
##
## Cancelled Chargedoff Completed
## 5 6650 18288
## Current Defaulted FinalPaymentInProgress
## 0 4010 0
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 0 0 0
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 0 0 0
## [1] 28953 81
## [1] "ProsperRating(Alpha) variable information:"
##
## Cancelled Chargedoff Completed
## 0 5336 19664
## Current Defaulted FinalPaymentInProgress
## 56576 1005 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
## [1] 84853 81
## [1] "Dimensions of data set with data in neither variable:"
## [1] 131 81
## [1] "Dimensions of data set with data in both variables:"
## [1] 0 81
I wish to include the loan credit rating in the analysis and so I have created a new column called “LoanRating” which is a combination of the values from CreditGrade and ProsperRating..Alpha. The 131 blank values are converted to NA and the variable is converted to a factor variable. This new column is then included in my list of shortlisted columns for further analysis. I also notice that there are columns for the date and quarter of loan origination. I will create a new column for the year of loan origination as that will be more useful to see if the relationships have changed over time.
## [1] "Table for new variable Year:"
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 22 5906 11460 11552 2047 5652 11228 19553 34345 12172
I have the necessary variables now to create a subset of the data set.
## [1] "Dimensions of the new data set: "
## [1] 113937 17
## [1] "Structure of the new data set: "
## 'data.frame': 113937 obs. of 17 variables:
## $ LoanRating : Factor w/ 8 levels "AA","A","B","C",..: 4 2 7 2 5 3 6 4 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ RevolvingCreditBalance: num 0 3989 NA 1444 6193 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ Year : chr "2007" "2014" "2007" "2012" ...
## $ Quarter : chr "Q3" "Q1" "Q1" "Q4" ...
## [1] "Summary of the new data set: "
## LoanRating Term BorrowerAPR EstimatedLoss
## C :23994 Min. :12.00 Min. :0.00653 Min. :0.005
## B :19970 1st Qu.:36.00 1st Qu.:0.15629 1st Qu.:0.042
## D :19427 Median :36.00 Median :0.20976 Median :0.072
## A :17866 Mean :40.83 Mean :0.21883 Mean :0.080
## E :13084 3rd Qu.:36.00 3rd Qu.:0.28381 3rd Qu.:0.112
## (Other):19465 Max. :60.00 Max. :0.51229 Max. :0.366
## NA's : 131 NA's :25 NA's :29084
## ProsperScore BorrowerState Occupation
## Min. : 1.00 CA :14717 Other :28617
## 1st Qu.: 4.00 TX : 6842 Professional :13628
## Median : 6.00 NY : 6729 Computer Programmer : 4478
## Mean : 5.95 FL : 6720 Executive : 4311
## 3rd Qu.: 8.00 IL : 5921 Teacher : 3759
## Max. :11.00 : 5515 Administrative Assistant: 3688
## NA's :29084 (Other):67493 (Other) :55456
## CreditScoreRangeLower CurrentCreditLines TotalInquiries
## Min. : 0.0 Min. : 0.00 Min. : 0.000
## 1st Qu.:660.0 1st Qu.: 7.00 1st Qu.: 2.000
## Median :680.0 Median :10.00 Median : 4.000
## Mean :685.6 Mean :10.32 Mean : 5.584
## 3rd Qu.:720.0 3rd Qu.:13.00 3rd Qu.: 7.000
## Max. :880.0 Max. :59.00 Max. :379.000
## NA's :591 NA's :7604 NA's :1159
## RevolvingCreditBalance DebtToIncomeRatio IncomeRange
## Min. : 0 Min. : 0.000 $25,000-49,999:32192
## 1st Qu.: 3121 1st Qu.: 0.140 $50,000-74,999:31050
## Median : 8549 Median : 0.220 $100,000+ :17337
## Mean : 17599 Mean : 0.276 $75,000-99,999:16916
## 3rd Qu.: 19521 3rd Qu.: 0.320 Not displayed : 7741
## Max. :1435667 Max. :10.010 $1-24,999 : 7274
## NA's :7604 NA's :8554 (Other) : 1427
## StatedMonthlyIncome LoanOriginalAmount Year
## Min. : 0 Min. : 1000 Length:113937
## 1st Qu.: 3200 1st Qu.: 4000 Class :character
## Median : 4667 Median : 6500 Mode :character
## Mean : 5608 Mean : 8337
## 3rd Qu.: 6825 3rd Qu.:12000
## Max. :1750003 Max. :35000
##
## Quarter
## Length:113937
## Class :character
## Mode :character
##
##
##
##
I notice that the variable Term has the same value for Median, 1st Quartile and 3rd Quartile. I wonder if there are only a handful of unique values and whether it is better to convert it to a factor variable. On analysis I find that this has only 3 unique values and so it does make sense to convert to a factor variable and rerun the structure and summary analysis
## [1] "Unique values of Term variable: "
## [1] 36 60 12
## [1] "Structure of the data set after creating factor variables: "
## 'data.frame': 113937 obs. of 17 variables:
## $ LoanRating : Factor w/ 8 levels "AA","A","B","C",..: 4 2 7 2 5 3 6 4 1 1 ...
## $ Term : Factor w/ 3 levels "12","36","60": 2 2 2 2 2 3 2 2 2 2 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ ProsperScore : Factor w/ 11 levels "1","2","3","4",..: NA 7 NA 9 4 10 2 4 9 11 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ RevolvingCreditBalance: num 0 3989 NA 1444 6193 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ Year : chr "2007" "2014" "2007" "2012" ...
## $ Quarter : chr "Q3" "Q1" "Q1" "Q4" ...
## [1] "Summary of the data set: "
## LoanRating Term BorrowerAPR EstimatedLoss
## C :23994 12: 1614 Min. :0.00653 Min. :0.005
## B :19970 36:87778 1st Qu.:0.15629 1st Qu.:0.042
## D :19427 60:24545 Median :0.20976 Median :0.072
## A :17866 Mean :0.21883 Mean :0.080
## E :13084 3rd Qu.:0.28381 3rd Qu.:0.112
## (Other):19465 Max. :0.51229 Max. :0.366
## NA's : 131 NA's :25 NA's :29084
## ProsperScore BorrowerState Occupation
## 4 :12595 CA :14717 Other :28617
## 6 :12278 TX : 6842 Professional :13628
## 8 :12053 NY : 6729 Computer Programmer : 4478
## 7 :10597 FL : 6720 Executive : 4311
## 5 : 9813 IL : 5921 Teacher : 3759
## (Other):27517 : 5515 Administrative Assistant: 3688
## NA's :29084 (Other):67493 (Other) :55456
## CreditScoreRangeLower CurrentCreditLines TotalInquiries
## Min. : 0.0 Min. : 0.00 Min. : 0.000
## 1st Qu.:660.0 1st Qu.: 7.00 1st Qu.: 2.000
## Median :680.0 Median :10.00 Median : 4.000
## Mean :685.6 Mean :10.32 Mean : 5.584
## 3rd Qu.:720.0 3rd Qu.:13.00 3rd Qu.: 7.000
## Max. :880.0 Max. :59.00 Max. :379.000
## NA's :591 NA's :7604 NA's :1159
## RevolvingCreditBalance DebtToIncomeRatio IncomeRange
## Min. : 0 Min. : 0.000 $25,000-49,999:32192
## 1st Qu.: 3121 1st Qu.: 0.140 $50,000-74,999:31050
## Median : 8549 Median : 0.220 $100,000+ :17337
## Mean : 17599 Mean : 0.276 $75,000-99,999:16916
## 3rd Qu.: 19521 3rd Qu.: 0.320 Not displayed : 7741
## Max. :1435667 Max. :10.010 $1-24,999 : 7274
## NA's :7604 NA's :8554 (Other) : 1427
## StatedMonthlyIncome LoanOriginalAmount Year
## Min. : 0 Min. : 1000 Length:113937
## 1st Qu.: 3200 1st Qu.: 4000 Class :character
## Median : 4667 Median : 6500 Mode :character
## Mean : 5608 Mean : 8337
## 3rd Qu.: 6825 3rd Qu.:12000
## Max. :1750003 Max. :35000
##
## Quarter
## Length:113937
## Class :character
## Mode :character
##
##
##
##
The dataset now contains 17 variables with almost 114,000 observations. Next I will run series of univariate plots on these variables
The loan rating of C has the maximum number of borrowers. The graph is a pyramid type structure with AA (the highest rating) has the least number of borrowers, as expected.
Most of the loans have been issued for 36 months. There are hardly any loans for 12 months and none for terms of 2yrs or 4 yrs. I wonder why this is so but potentially this could be due to a non-random sampling of data or due to commercial reasons. We are unable to investigate that in this exercise.
Most of the loans seem to have originated after 2009. There is a sudden dip in the originated loans after 2008. I assume this may be due to the financial crisis but since then there has been a steady growth in loan origination. I see reduction in 2014 which can happen if the sample data does not cover whole of 2014.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229 25
The BorrowerAPR represents the interest rate being charged to the borrower. The data is in decimals but since this is a percentage value, I have multiplied by 100 to make it easier to read and interpret. Most of the borrowers are being charged the rate between 15% and 30% and the maximum rate being charged is around 51% which is roughly 2.5 times the median. On reducing the binwidth, I see a spike in number of borrowers being charged between 35% and 37%. Visually it looks as if there are 3 pyramid structures - 6% to around 12%, 12% to 33% and 33% to 51%. I wonder if this is due to different terms of the loans as the area under each pyramid seems to be in line with the count values from histogram of the Term variable.
The above are the the BorrowerAPR histograms split by Term. The BorrowerAPR is spread across a wide range for each term and so this confirms that if there are 3 distinct pyramids, then they are not due to the 3 terms.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.005 0.042 0.072 0.080 0.112 0.366 29084
The EstimatedLoss for most loans seems to be within 0.112 but there are some outliers with the maximum value being 0.366. There seem to be a lot of NA which is surprising as we would expect each loan to have an estimate for expected loss value. To identify the reason, I have run a test of the status of the loans that have the EstimatedLoss as NA. I run this on the original loans dataset and not the subset of 17 variables as LoanStatus is not included in the subset.
## [1] "LoanStatus table where EstimatedLoss is NA"
##
## Cancelled Chargedoff Completed
## 5 6656 18410
## Current Defaulted FinalPaymentInProgress
## 0 4013 0
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 0 0 0
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 0 0 0
The results show that all such loans are either Cancelled or ChargedOff or Completed or Defaulted. This seems that the loans are no longer active and so the expected loss from them in future is not defined.
## 1 2 3 4 5 6 7 8 9 10 11 NA's
## 992 5766 7642 12595 9813 12278 10597 12053 6911 4750 1456 29084
Most of the loans have the ProsperScore between 4 and 8 and the structure is similar to the LoanRating histogram.
## AK AL AR AZ CA CO CT DC DE FL GA
## 5515 200 1679 855 1901 14717 2210 1627 382 300 6720 5008
## HI IA ID IL IN KS KY LA MA MD ME MI
## 409 186 599 5921 2078 1062 983 954 2242 2821 101 3593
## MN MO MS MT NC ND NE NH NJ NM NV NY
## 2318 2615 787 330 3084 52 674 551 3097 472 1090 6729
## OH OK OR PA RI SC SD TN TX UT VA VT
## 4197 971 1817 2972 435 1122 189 1737 6842 877 3278 207
## WA WI WV WY
## 3048 1842 391 150
Roughly 15,000 of the borrowers are from the State of CA which is more than 10% of the dataset. At the same time, there are several states such as ME, AK, IA, ND and WY that have less than or equal to 200 borrowers each. I wonder why the distribution is so skewed and whether it is linked to the population distribution across the States or is there some other factor(s).
## Accountant/CPA
## 3588 3233
## Administrative Assistant Analyst
## 3688 3602
## Architect Attorney
## 213 1046
## Biologist Bus Driver
## 125 316
## Car Dealer Chemist
## 180 145
## Civil Service Clergy
## 1457 196
## Clerical Computer Programmer
## 3164 4478
## Construction Dentist
## 1790 68
## Doctor Engineer - Chemical
## 494 225
## Engineer - Electrical Engineer - Mechanical
## 1125 1406
## Executive Fireman
## 4311 422
## Flight Attendant Food Service
## 123 1123
## Food Service Management Homemaker
## 1239 120
## Investor Judge
## 214 22
## Laborer Landscaping
## 1595 236
## Medical Technician Military Enlisted
## 1117 1272
## Military Officer Nurse (LPN)
## 346 492
## Nurse (RN) Nurse's Aide
## 2489 491
## Other Pharmacist
## 28617 257
## Pilot - Private/Commercial Police Officer/Correction Officer
## 199 1578
## Postal Service Principal
## 627 312
## Professional Professor
## 13628 557
## Psychologist Realtor
## 145 543
## Religious Retail Management
## 124 2602
## Sales - Commission Sales - Retail
## 3446 2797
## Scientist Skilled Labor
## 372 2746
## Social Worker Student - College Freshman
## 741 41
## Student - College Graduate Student Student - College Junior
## 245 112
## Student - College Senior Student - College Sophomore
## 188 69
## Student - Community College Student - Technical School
## 28 16
## Teacher Teacher's Aide
## 3759 276
## Tradesman - Carpenter Tradesman - Electrician
## 120 477
## Tradesman - Mechanic Tradesman - Plumber
## 951 102
## Truck Driver Waiter/Waitress
## 1675 436
The borrowers seem to be broadly spread across all occupations with couple of exceptions. The highest number is from “Other” and next from “Professional”. These don’t give much clarity into the occupation and so I wonder whether this field is useful for this analysis.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
## [1] "Number of rows with CreditScoreRangeLower value of 0.0"
## [1] 133 17
Re-running the histogram to ignore the values of 0.0
Most of the values for CreditScoreRangeLower are between 660 and 720. The graph has a bigger left tail compared to the right.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.00 10.00 10.32 13.00 59.00 7604
Most of the borrowers have between 7 and 13 current credit lines. By definition, this is a minimum of 0. Though, some borrowers with 0 current credit lines have received loans, we should bear in mind that this data only covers people that had the loans accepted, and not the people who applied for loans. There is a possibility that only a small proportion of such people were accepted due to lack of prior credit history.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 2.000 4.000 5.584 7.000 379.000 1159
## [1] "Dimension of data set with TotalInquiries > 50: "
## [1] 202 17
Most people have less than or equal to 7 total inquiries. This dataset has a very long tail, with a maximum value of 379. There are 202 values above 50, out of a total dataset of around 114,000 values which is quite small. I wonder if this is due to the loan origination criteria set by Prosper or whether this is a reflection of the population in general.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 3121 8549 17599 19521 1435667 7604
## [1] "Dimension of the data set with RevolvingCreditBalance > 50000: "
## [1] 7318 17
Most of the borrowers have RevolvingCreditBalance of less than 20,000. This value varies greatly across the borrowers with the maximum being around 1,435,000. In this dataset, over 7300 borrowers have this value above 50000. I wonder if this value in itself can directly influence the pricing of the loans as individuals with different income levels can have a different level of outstanding credit balance but the same proportional value, and by intuition, I would guess that the pricing of the loan for an amount in same proportion should be very similar.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
## [1] "Dimension of the data set with DebtToIncomeRatio > 1: "
## [1] 799 17
Most of the borrowers have DebtToIncomeRatio of less that 0.32. There are roughly 800 borrowers with the ratio of more than 1.0 and the maximum value is around 10.0. Similar to the charts above, this graph has a smooth peak and big tail.
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 621 7274 17337 32192 31050
## $75,000-99,999 Not displayed Not employed
## 16916 7741 806
More than half of the borrowers, roughly 63,000, have income level between $25,000 and $75,000. Another 35,000 borrowers have income level above $75,000. The dataset doesn’t tell us the maximum income level across all borrowers. This may make the analysis bit difficult for borrowers with high income if the actual variation in income above $100,000 is very large.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750003
## [1] "Dimension of the data set with StatedMonthlyIncome > 10000: "
## [1] 9780 17
Most borrowers have a monthly income of less than $7,000 and the median is around 4,650. The maximum monthly income is $1,750,000 which is much bigger than the median. Roughly 9800 borrowers have monthly income above $10,000 which equates to $120,000 per annum. Comparing this to the IncomeRange data, I can imply that roughly 7,500 borrowers have an annual income between $100,000 and $120,000 and remaining 9,800 borrowers have annual income above $120,000. This shows that the distribution of the income in the $100,000+ bracket of the IncomeRange is very wide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
## [1] "Dimension of the data set with LoanOriginalAmount > 25000: "
## [1] 680 17
Most of the loans have an original amount of less than $12,000. The maximum loan amount is $35,000. The median is $6,500 and 680 loans have an amount above $25,000. I notice from the graph that most loan amounts seem to be rounded values and there are some distinct peaks at $3000, $5000, $10,000, $15000 and $20,000. I wonder if this is due to borrowers’ behaviour that they tend to go for rounded amounts to cover their financial requirements, instead of an exact amount to meet a specific requirement.
The dataset contains information on about 114,000 loans issued by Prosper and includes loan and borrower data. The original dataset has 81 variables and I have taken a subset of 17 variables for the purporse of this analysis. The variable LoanRating is an ordered factor variable with the following levels:
(worst) —————————> (best)
LoanRating: NC, HR, E, D, C, B, A, AA
Some observations about the dataset:
Most of the loans have a maturity of 36 months.
The median CreditScoreRangeLower is 680
Most borrowers have between 7 and 13 current credit lines.
Most borrowers have DebtToIncomeRatio of less than 0.32 and the maximum is 10
Most borrowers have monthly income of less than $7,000 and the median is around 4,650. The maximum monthly income is $1,750,000
There is a sudden dip in number of loan originations in 2008.
The main interesting feature of the dataset is the pricing of the loans and the assesment of credit rating (LoanRating). I would like to determine which features are best for predicting the BorrowerAPR of the loan. I think DebtToIncomeRatio, CreditScoreRangeLower and some other variables can be used to build a predictive model to price the Prosper loans.
In addition to DebtToIncomeRatio and CreditScoreRangeLower, I think current credit lines and loan amount may also contribute to the pricing. I am also interested in checking if there is any change in relations around 2009 when the loan originations suddenly dipped and have increased steadily since.
I have created a new variable LoanRating by combining CreditGrade and ProsperRating (Alpha). CreditGrade covers loans from period pre-2009 and ProsperRating (Alpha) covers post 2009 period and so it makes sense to combine them for our analysis. I have also created new variables Year and Quarter by splitting the LoanOriginationQuarter variable.
On plotting the chart for BorrowerAPR, I noticed 3 different pyramid structures. To check if this was driven by Term variable, with 3 values, I split the chart by Term but realised that all 3 charts have a similar chart pattern. The 3 peaks of the pyramids are not caused by Term. For some reason, there is a sudden rise in the number of loans with BorrowerAPR between 35 and 37. I also notice that there are sudden spikes in the histogram for EstimatedLoss. When I started the project, I thought that the loans with a higher expected chance of loss will have a higher borrower rate, but the histograms for these two do not look similar. The charts start to look similar when observed from left but there is sudden discontinuity in the distribution of EstimatedLoss for values above 0.10 with several peaks. I find this strange and unexpected. I have found a sudden dip in loan origination numbers in 2009, followed by a consistent increase. I am interested in understanding what impact this has on relationships amongst the variables.
I have created new variables LoanRating, Year and Quarter as mentioned above.
Based on the Univariate analysis, I will create a subset of the previous data set with 13 variables.
## [1] "New dataset dimensions and variable names: "
## [1] 113937 13
## [1] "BorrowerAPR" "EstimatedLoss"
## [3] "LoanRating" "Term"
## [5] "ProsperScore" "CreditScoreRangeLower"
## [7] "CurrentCreditLines" "TotalInquiries"
## [9] "DebtToIncomeRatio" "StatedMonthlyIncome"
## [11] "LoanOriginalAmount" "Year"
## [13] "Quarter"
I will first create a scatterplot matrix using these variables. I will exclude Year and Quarter from this matrix as they do not represent the terms of the loans
EstimatedLoss correlates with BorrowerAPR which makes sense as borrowers with a higher expected loss should be charged a higher borrowing rate. ProsperScore and CreditScoreRangeLower show moderate correlation with BorrowerAPR. LoanOriginalAmount also shows a moderate correlation with BorrowerAPR but it is negative which I find surprising as higher the loan amount for a borrower greater is the risk and so a higher BorrowerAPR should apply.
EstimatedLoss shows a strong relationship with LoanRating and I would like to see how the relationship looks between LoanRating and BorrowerAPR.
It seems that the CurrentCreditLines, TotalInquiries, StatedMonthlyIncome and DebtToIncomeRatio do not have correlation with the BorrowerAPR. The median of the BorrowerAPR for each Term looks similar but it seems that BorrowerAPR has a wider interquartile range as Term increases.
I want to take a closer look at the scatter plots involving BorrowerAPR and other variables.
As EstimatedLoss increases, BorrowerAPR increases too. We can see some vertical lines where the BorrowerAPR differs significantly for the same EstimatedLoss. The relationship looks non-linear as the plot seems to flatten out as EstimatedLoss increases, especially after EstimatedLoss of 0.15.
##
## Call:
## lm(formula = I(BorrowerAPR) ~ I(EstimatedLoss), data = subset(loans.sub.2,
## !is.na(BorrowerAPR)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45226 -0.01117 -0.00189 0.01238 0.19705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.096346 0.000171 563.4 <2e-16 ***
## I(EstimatedLoss) 1.622699 0.001840 881.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02507 on 84851 degrees of freedom
## (29059 observations deleted due to missingness)
## Multiple R-squared: 0.9016, Adjusted R-squared: 0.9016
## F-statistic: 7.776e+05 on 1 and 84851 DF, p-value: < 2.2e-16
Despite the fact that the relationship doesn’t look linear, based on the R^2 value, EstimatedLoss still explains 90% of the variance in BorrowerAPR.
We can see a clear relationship between ProsperScore and the median of BorrowerAPR. As ProsperScore gets low, the median of BorrowerAPR goes up and this is consistent with the negative correlation we saw above.
In this density plot, we see that the peaks move from low borrow rate (for Score 9) to high borrow rate (for Score 1)
##
## Call:
## lm(formula = I(BorrowerAPR) ~ I(ProsperScore), data = subset(loans.sub.2,
## !is.na(BorrowerAPR)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20222 -0.04307 -0.00692 0.03908 0.27641
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.340580 0.001852 183.86 <2e-16 ***
## I(ProsperScore)2 -0.032346 0.002005 -16.13 <2e-16 ***
## I(ProsperScore)3 -0.057305 0.001969 -29.11 <2e-16 ***
## I(ProsperScore)4 -0.081103 0.001924 -42.16 <2e-16 ***
## I(ProsperScore)5 -0.077726 0.001944 -39.99 <2e-16 ***
## I(ProsperScore)6 -0.101991 0.001926 -52.96 <2e-16 ***
## I(ProsperScore)7 -0.124963 0.001937 -64.51 <2e-16 ***
## I(ProsperScore)8 -0.162060 0.001927 -84.10 <2e-16 ***
## I(ProsperScore)9 -0.191603 0.001981 -96.73 <2e-16 ***
## I(ProsperScore)10 -0.225458 0.002037 -110.70 <2e-16 ***
## I(ProsperScore)11 -0.228745 0.002402 -95.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05834 on 84842 degrees of freedom
## (29059 observations deleted due to missingness)
## Multiple R-squared: 0.4671, Adjusted R-squared: 0.467
## F-statistic: 7437 on 10 and 84842 DF, p-value: < 2.2e-16
The relationship is not linear but based on R^2 value, ProsperScore explains 46.7% of the variation in BorrowerAPR.
CreditScoreRangeLower offers an interesting relationship. There is no visible relationship for CreditScoreRangeLower values below 600 as the median of BorrowerAPR shows only small variation. There is a clear relationship for values greater than 600 as the median reduces with increasing CreditScoreRangeLower. This difference in relation explains why CreditScoreRangeLower was moderately and negatively correlated. This relation across all the values is not linear.
We can see several vertical lines and no general relationship. There is a wide variation in BorrowerAPR across most LoanOriginalAmount values. The variation reduces at certain points and it seems that the mid point of those variations also reduces which may be the reason why we get moderate and negative correlation for LoanOriginalAmount.
##
## Call:
## lm(formula = I(BorrowerAPR) ~ I(LoanOriginalAmount), data = subset(loans.sub.2,
## !is.na(BorrowerAPR)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.23447 -0.05708 0.00010 0.05883 0.26533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.535e-01 3.759e-04 674.3 <2e-16 ***
## I(LoanOriginalAmount) -4.154e-06 3.608e-08 -115.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07606 on 113910 degrees of freedom
## Multiple R-squared: 0.1043, Adjusted R-squared: 0.1042
## F-statistic: 1.326e+04 on 1 and 113910 DF, p-value: < 2.2e-16
The scatterplot did not show a clear relationship but based on R^2 value, it seems the LoanOriginalAmount is able to explain roughly 10% of the variation in BorrowerAPR rate using a linear model.
This sounds counter-intuitive to me as a bigger loan amount should not be reducing the borrowing rate. I realise that the data set includes only the data for the loans that were approved and issued and so this data has a particular bias as it doesn’t include loans where the BorrowerAPR was high and the borrower refused to take the loan at that rate. I wonder if this is distorting the actual relationship since the borrowers can either be sensitive to borrow rate or have more options than Prosper for high loan amounts. We know from above analysis that a high EstimatedLoss implies a higher BorrowerAPR and a high EstimatedLoss is associated with an individual unable to pay back. The scatterplot matrix shows that the DebtToIncomeRatio has a moderate correlation to EstimatedLoss but not to BorrowerAPR. However, according to variable definitions, DebtToIncomeRatio is the debt to income ratio of the borrower at the time the credit profile is pulled and not the resulting ratio after the loan has been provided. I will create a new variable to represent this new ratio and run tests with it. I will divide the loan amount with the stated monthly income annualised (multiplied by 12) and then add this value to DebtToIncomeRatio. To remove errors due to extremely low monthly income values such as $1 or less, I will take the DebtToIncomeRatio as the new ratio for the bottom 2% of the stated monthly incomes cases.
## [1] "Summary of the new debt to income ratio: "
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.240 0.357 0.418 0.500 10.831 8554
##
## Call:
## lm(formula = I(BorrowerAPR) ~ I(newratio), data = subset(loans.sub.2,
## !is.na(BorrowerAPR)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.227366 -0.062144 -0.009422 0.060353 0.297288
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2147422 0.0003048 704.553 <2e-16 ***
## I(newratio) 0.0035818 0.0004336 8.261 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07949 on 105356 degrees of freedom
## (8554 observations deleted due to missingness)
## Multiple R-squared: 0.0006473, Adjusted R-squared: 0.0006378
## F-statistic: 68.24 on 1 and 105356 DF, p-value: < 2.2e-16
The scatterplot shows no relationship between the new debt to income ratio and there is certainly no linear relationship either, based on R^2 value. I don’t have access to the data where the loans were not issued, especially for high loan amounts in order to establish whether there is any bias or not with the given data set. For the moment, I will need to assume that the analysis is restricted to loans that were issued and care should be taken before extending to a new loan valuation.
I wish to run density plot for LoanOriginalAmount and will need a factor variable. I am going to split the LoanOriginationAmount into 5 categories (A for lowest and E to highest) based on the placement of the value in 5 equal quartiles.
## [1] "Table for the new variable LoanAmtFactor: "
##
## A B C D E
## 17815 27019 22807 22945 23351
The density plot shows that lower amounts of LoanOriginalAmount, symbolised by LoanAmtFactor of A, have most loans issued at high values of BorrowerAPR. For the highest loan amounts (Loan Factor E), the concentration of BorrowerAPR value is relatively low. As we go from A to E, the loans have a much more even distribution across BorrowerAPR.
The boxplot shows that the lower values of LoanAmtFactor have a higher median BorrowerAPR. The median reduces as LoanAmtFactor increases, that is, as the LoanOriginalAmount increases.This shows why the correlation for BorrowerAPR and LoanOriginalAmount was negative. This still does not explain why we see this behaviour and this may be due to the bias introduced by the sampling method of the loans.
We see a clear relation between LoanRating and BorrowerAPR as the BorrowerAPR median values increase with LoanRating, except for NC.
The density plot shows the relationship with BorrowerAPR more clearly. I notice clear separated peaks for LoanRating with peak at lowest BorrowerAPR for AA LoanRating and peaks moving to higher BorrowerAPR with increase in LoanRating.
##
## Call:
## lm(formula = I(BorrowerAPR) ~ I(LoanRating), data = subset(loans.sub.2,
## !is.na(BorrowerAPR)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31895 -0.01609 0.00239 0.02376 0.23531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0964053 0.0004059 237.51 <2e-16 ***
## I(LoanRating)A 0.0419086 0.0004965 84.41 <2e-16 ***
## I(LoanRating)B 0.0832982 0.0004878 170.78 <2e-16 ***
## I(LoanRating)C 0.1220308 0.0004750 256.90 <2e-16 ***
## I(LoanRating)D 0.1695741 0.0004898 346.18 <2e-16 ***
## I(LoanRating)E 0.2191068 0.0005257 416.75 <2e-16 ***
## I(LoanRating)HR 0.2311807 0.0005520 418.83 <2e-16 ***
## I(LoanRating)NC 0.1386092 0.0032556 42.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03822 on 113773 degrees of freedom
## (131 observations deleted due to missingness)
## Multiple R-squared: 0.7737, Adjusted R-squared: 0.7737
## F-statistic: 5.557e+04 on 7 and 113773 DF, p-value: < 2.2e-16
LoanRating explains 77% of the variation in BorrowerAPR values based on R^2 value.
I want to check the relationship between LoanRating and EstimatedLoss, based on the scatterplot matrix results.
## loans.sub.2$LoanRating: AA
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.005 0.010 0.015 0.014 0.017 0.020 3509
## --------------------------------------------------------
## loans.sub.2$LoanRating: A
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.020 0.025 0.032 0.031 0.037 0.040 3315
## --------------------------------------------------------
## loans.sub.2$LoanRating: B
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.041 0.047 0.052 0.052 0.057 0.060 4389
## --------------------------------------------------------
## loans.sub.2$LoanRating: C
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.061 0.067 0.077 0.076 0.085 0.090 5649
## --------------------------------------------------------
## loans.sub.2$LoanRating: D
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.092 0.099 0.108 0.106 0.112 0.119 5153
## --------------------------------------------------------
## loans.sub.2$LoanRating: E
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.122 0.132 0.142 0.140 0.147 0.149 3289
## --------------------------------------------------------
## loans.sub.2$LoanRating: HR
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.152 0.162 0.165 0.173 0.170 0.366 3508
## --------------------------------------------------------
## loans.sub.2$LoanRating: NC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 141
This is an interesting result. Each LoanRating value has non overlapping values of EstimatedLoss. Moreover, the maximum and minumum values for each rating are very close to the minimum value of higher rating and maximum value of lower rating respectively. This suggests that the LoanRating values are categorisation of EstimatedLoss.
## loans.sub.2$Term: 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04935 0.14716 0.22189 0.21622 0.29167 0.35843
## --------------------------------------------------------
## loans.sub.2$Term: 36
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.14857 0.20976 0.21943 0.29265 0.51229 25
## --------------------------------------------------------
## loans.sub.2$Term: 60
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07111 0.17184 0.20931 0.21684 0.25718 0.35838
The median of the BorrowerAPR for all the 3 terms is almost the same. The terms 12 and 36 have almost th same Inter Quartile Range while this is much smaller for Term of 60. The maximum value for Terms 12 and 60 is almost same as just under 0.36. This means that the big BorrowerAPR numbers with the third “pyramid” style peak are all from Term 36.
Given the high correlation between BorrowerAPR and EstimatedLoss, I want to see these plots with EstimateLoss too.
## loans.sub.2$Term: 12
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00490 0.03500 0.06240 0.07413 0.10800 0.14750 1
## --------------------------------------------------------
## loans.sub.2$Term: 36
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.005 0.040 0.077 0.084 0.119 0.366 29083
## --------------------------------------------------------
## loans.sub.2$Term: 60
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0049 0.0474 0.0650 0.0715 0.0899 0.1490
EstimatedLoss has a similar profile, except that its median for Term 36 is a bit higher than others.
## loans.sub.2$Year: 2005
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 22
## --------------------------------------------------------
## loans.sub.2$Year: 2006
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.14709 0.19884 0.20061 0.25757 0.51229 3
## --------------------------------------------------------
## loans.sub.2$Year: 2007
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00653 0.13501 0.17722 0.18593 0.23662 0.37036
## --------------------------------------------------------
## loans.sub.2$Year: 2008
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01315 0.13625 0.17791 0.20363 0.26214 0.41355
## --------------------------------------------------------
## loans.sub.2$Year: 2009
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05636 0.12511 0.21122 0.21594 0.28777 0.39951
## --------------------------------------------------------
## loans.sub.2$Year: 2010
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04583 0.13068 0.24999 0.23899 0.33097 0.42395
## --------------------------------------------------------
## loans.sub.2$Year: 2011
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05927 0.19108 0.29486 0.26230 0.34731 0.38723
## --------------------------------------------------------
## loans.sub.2$Year: 2012
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06587 0.18637 0.25781 0.25315 0.33553 0.35840
## --------------------------------------------------------
## loans.sub.2$Year: 2013
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06106 0.16732 0.20933 0.21395 0.25781 0.35797
## --------------------------------------------------------
## loans.sub.2$Year: 2014
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06726 0.14206 0.17710 0.18272 0.21648 0.34630
The median BorrowerAPR increased sharply after 2008 till 2011 and then it has consistently reduced till the last year of 2014 in our sample. Along with the median increase, the IQR also increased between 2008 and 2011, after which it collapsed suddenly. The IQR has been the lowest in 2013 and 2014. There looks to be a positive relation between median and IQR, split by year.
Earlier in the univariate plots, I had found several peaks in the BorrowerAPR histogram. The above plot shows that some of the peaks are prevalent in a particular year. For example, the peak at BorrowerAPR value above 35 is in year 2012. In 2013, we see a more balanced distribution. I wonder if this behaviour is a result of changes in business and customer behaviour or due to some changes in relationships between the variables.
The BorrowerAPR rate correlates highly with the EstimatedLoss values, ProsperScore and LoanRating.
The Borrower’s interest rate goes up as the estimated loss value of the loan goes up. In the scatterplot of BorrowerAPR vs EstimatedLoss, there are several vertical bands representing data points where the BorrowerAPR rate varies with similar EstimatedLoss. Using a linear model, I find that EstimatedLoss explains roughly 90% of the variation in BorrowerAPR. The Borrower’s interest rate goes up as the quality of loan reduces, represented by Loan Rating. Based on a linear model, Loan Rating explains 77% of the variation in BorrowerAPR. ProsperScore has a negative relation with BorrowerAPR as the Borrower’s interest rate goes up with a lower ProsperScore. Through a linear model, I find that ProsperScore explains 46.7% of the variation in BorrowerAPR.
There are 2 more features, CreditScoreRangeLower and LoanOriginalAmount that also have a relationship with BorrowerAPR that isn’t so strong. As the CreditScoreRangeLower value goes high, the median of BorrowerAPR comes down but this relationship only holds true once the CreditScoreRangeLower value is above 600. The LoanOriginalAmount has a negative correlation with BorrowerAPR and explains 10% of the variation in the Borrower’s interest rate. However, this result does not look intuitive to me and so I doubt that there is a bias introduced in the results due to the given dataset.
EstimatedLoss has a strong negative correlation with ProsperScore, CreditScoreRangeLower and LoanOriginalAmount. In addition, ProsperScore has a medium positive correlation with CreditScoreRangeLower. TotalInquires has a medium negative correlation with ProsperScore and CreditScoreRangeLower. The CurrentCreditLines has a medium positive correlation with CreditScoreRangeLower. RevolvingCreditBalance has a positive correlation with CurrentCreditLines. I have deduced that the LoanRating values are categorisation of the EstimatedLoss values and hence, should not be treated as a separate variable. BorrowerAPR values have a noticeably different distribution across the years. The reason behind this is not clear and could be dependent on customer behaviour or other factors that are outside the scope of this analysis. It is also possible that the relationships identified above have changed over time.
The strongest relationship is between BorrowerAPR and EstimatedLoss. As per the linear model, based on R^2 values, EstimatedLoss explains 90% of the variation in the BorrowerAPR values. I found a strong relation between BorrowerAPR and LoanRating as well, but having also found that LoanRating is a categorisation of EstimatedLoss, I will not treat it as an independent variable.
It is clear that EstimatedLoss is a key driver to the value of BorrowerAPR. In previous section, we observed in the scatter plot for BorrowerAPR vs EstimatedLoss that there were several vertical lines representing multiple values of BorrowerAPR for the same value of EstimatedLoss. I am interested in finding what drives these variations in values. In my next round of plots, I want to see how other factors that show relationship with BorrowerAPR (shown in Bivariate Plots analysis) relate to the ratio of BorrowerAPR and EstimatedLoss. I will create a new variable “APR_by_Loss” as the ratio of BorrowerAPR and EstimatedLoss.
## [1] "Summary of new variable APR_by_Loss"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.591 2.558 3.040 3.429 3.863 35.613 29084
We see that there is a reasonable divergence in the values for APR_by_Loss. It is this divergence that we want to explain using other variables.
## loans.sub.2$Term: 12
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.140 2.677 3.157 3.554 3.824 13.447 1
## --------------------------------------------------------
## loans.sub.2$Term: 36
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.591 2.470 2.942 3.428 3.934 35.613 29083
## --------------------------------------------------------
## loans.sub.2$Term: 60
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.080 2.676 3.091 3.424 3.788 24.492
We see that the vertical lines in the BorrowerAPR vs EstimatedLoss are not explained by Term as most seem to occur for a specific value of Term. We do notice that in case of Term of 60, the plot seems to diverge a bit more around the plots for Terms 12 and 36, ignoring the vertical lines. This tells me that there may be some variable causing the divergence.
## loans.sub.2$LoanAmtFactor: A
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.594 2.308 2.726 3.312 3.423 25.592 8861
## --------------------------------------------------------
## loans.sub.2$LoanAmtFactor: B
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.594 2.228 2.454 2.853 2.996 20.177 6295
## --------------------------------------------------------
## loans.sub.2$LoanAmtFactor: C
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.591 2.668 3.065 3.513 3.934 35.613 7259
## --------------------------------------------------------
## loans.sub.2$LoanAmtFactor: D
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.645 2.729 3.253 3.595 3.999 24.492 3687
## --------------------------------------------------------
## loans.sub.2$LoanAmtFactor: E
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.085 2.952 3.529 3.847 4.257 16.990 2982
The median of the APR_by_Loss increases as the LoanAmtFactor increases which implies a higher value of borrow rate for same Estimated Loss. This is the relationship I was expecting, instead of the negative correlation between the BorrowerAPR and LoanOriginalAmount. As a result, I believe that LoanAmtFactor is a better variable to study the relationship with BorrowerAPR.
The plots show that the vertical lines appear mainly for the lower values of the Loan amounts. I also notice that if I were to create a linear model for BorrowerAPR using EstimatedLoss then I will get a different slope for each value of LoanAmtFactor. The slope of the line increases as the loan amount increases. Based on the above analysis, I think LoanAmtFactor does have a non-random relationship with BorrowerAPR and should be considered when creating a model to predict BorrowerAPR.
## loans.sub.2$ProsperScore: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5911 2.1371 2.2801 2.1811 2.4052 4.2704
## --------------------------------------------------------
## loans.sub.2$ProsperScore: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.8202 2.2062 2.3664 2.4134 2.5530 4.8544
## --------------------------------------------------------
## loans.sub.2$ProsperScore: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.8193 2.2519 2.5039 2.5726 2.7871 5.4795
## --------------------------------------------------------
## loans.sub.2$ProsperScore: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.139 2.390 2.676 2.769 3.050 5.702
## --------------------------------------------------------
## loans.sub.2$ProsperScore: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.067 2.425 2.773 2.900 3.323 5.726
## --------------------------------------------------------
## loans.sub.2$ProsperScore: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.347 2.640 2.952 3.167 3.638 12.909
## --------------------------------------------------------
## loans.sub.2$ProsperScore: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.406 2.827 3.125 3.401 3.891 13.243
## --------------------------------------------------------
## loans.sub.2$ProsperScore: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.185 3.166 3.727 3.987 4.615 10.589
## --------------------------------------------------------
## loans.sub.2$ProsperScore: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.234 3.595 4.411 4.592 5.379 11.813
## --------------------------------------------------------
## loans.sub.2$ProsperScore: 10
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.618 4.225 5.210 6.139 6.714 35.613
## --------------------------------------------------------
## loans.sub.2$ProsperScore: 11
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.529 4.219 4.852 5.520 6.389 9.089
I realise that loans with lower ProsperScore have less occurence of low BorrowerAPR. These are also the loans with some of the highest EstimatedLoss values. The vertical lines in the scatter plots reduce as the ProsperScore increases but these are spread over a wide range of ProsperScore values. The slope of the linear plot increases as ProsperScore increases which is consistent with the increase in median of APR_by_Loss with increase in ProsperScore. I believe that ProsperScore certainly has a relation with BorrowerAPR and should be considered in creating a model.
Combining these results with those from LoanAmtFactor, I infer that loans with lower ProsperScore have loans of mainly low amounts, although, the causality cannot be established from this analysis.
## factor(loans.sub.2$CreditScoreRangeLower): 0
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 133
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 360
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 1
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 420
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 5
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 440
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 36
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 460
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 141
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 480
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 346
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 500
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 554
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 520
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 1593
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 540
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 1474
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 560
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 1357
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 580
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 1125
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 600
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.7058 2.2604 2.4014 2.3829 2.6632 4.9298 2562
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 620
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.7915 2.2801 2.4297 2.4708 2.6951 4.9298 2519
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 640
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.594 2.283 2.582 2.658 2.863 8.243 3350
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 660
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.5911 2.3970 2.7208 2.8591 3.1324 8.2455 2233
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 680
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.7328 2.4627 2.8595 3.0275 3.4292 8.2435 2473
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 700
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.7967 2.6327 3.0842 3.2969 3.7525 10.5952 1861
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 720
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.7507 2.7988 3.3651 3.6103 4.0705 18.9950 1889
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 740
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.8066 2.9787 3.6345 3.9519 4.5809 35.6133 1396
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 760
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.126 3.141 3.907 4.258 4.970 18.728 1353
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 780
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.001 3.590 4.552 5.021 5.726 24.492 919
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 800
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.550 3.873 4.943 5.532 6.496 22.603 537
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 820
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.8745 4.1537 5.3813 5.9067 6.8280 17.0567 367
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 840
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.203 4.444 5.861 6.400 7.708 23.033 169
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 860
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.203 4.552 5.921 6.464 7.837 16.718 90
## --------------------------------------------------------
## factor(loans.sub.2$CreditScoreRangeLower): 880
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.724 5.761 6.828 6.725 7.710 13.084 10
The plots show that the vertical lines appear mainly for the lower values of the CreditScoreRangeLower. I also notice that if I were to create a linear model for BorrowerAPR using EstimatedLoss then I will get a different slope for each value of CreditScoreRangeLower. The slope of the line increases as the loan amount increases. Based on the above analysis, I think CreditScoreRangeLower does have a non-random relationship with BorrowerAPR and should be considered when creating a model to predict BorrowerAPR.
I notice that there are no data points for EstimatedLoss for years 2005 to 2008. This must be the loans that have N/A for EstimatedLoss due to expiry or other reasons that we found in the Univariate analysis.
I will create a new data set to remove these years. This will not have any adverse impact to the analysis as I need only data that has EstimatedLoss values. I will use the new data set going forward.
## [1] "Dimensions of the new data set: "
## [1] 84997 17
I will recreate the last plot with the new dataset.
The vertical lines on the BorrowerAPR vs EstimatedLoss scatterplots are present for years before (and including) 2010 only. This is an interesting finding as it implies that some relationships have changed at the end of 2010 to remove the vertical bands and provide an almost linear profile on these scatter plots.
##
## Call:
## lm(formula = I(BorrowerAPR) ~ I(EstimatedLoss), data = subset(loans.sub.2.sub_year,
## !is.na(BorrowerAPR) & Year >= 2011))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.112996 -0.012286 -0.001311 0.011901 0.070112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0900705 0.0001321 681.6 <2e-16 ***
## I(EstimatedLoss) 1.7089812 0.0014547 1174.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01776 on 77296 degrees of freedom
## Multiple R-squared: 0.947, Adjusted R-squared: 0.947
## F-statistic: 1.38e+06 on 1 and 77296 DF, p-value: < 2.2e-16
On running the linear model for BorrowerAPR vs EstimatedLoss using the data from and including 2011, I find that EstimatedLoss explains 94.7% of the variation in BorrowerAPR. This is an improvement over the 90% R-squared value observed in bivariate analysis for EstimatedLoss that included the data across all the years.
I will now run some tests to see what further relationships exist when the plots are split by year.
We can see that 2014 only has values for 1st quarter which explains the relatively low number of loan originations for 2014 in our data set. For 2013, I see that for same EstimatedLoss value, the BorrowerAPR rate is higher in Q1 vs Q4.
Looking at 2013 plot more closely shows the above observation clearly. This can be due to change in Prosper’s loan pricing method or particular short term concentration in a particular type of loan or customer behaviour. I am unable to analyse this due to the lack of data.
We see a consistent behaviour across all years for ProsperScore.
The behaviour here is also consistent across the years.
In the years 2009 and 2010, most of the loans had a low value for LoanAmtFactor with the scatterplot showing data across the chart and no relationships can be observed. From 2011 onwards, we see that lower LoanAmtFactor tend to have higher BorrowerAPR and EstimatedLoss. Also during these years, the slopes of the linear models for BorrowerAPR vs EstimatedLoss, split by LoanAmtFactor, diverge only slightly. LoanAmtFactor A has lowest slope while E has highest slope.
In these plots, I find that for 2009 and 2010, most of the loans have Term of 36 and so not much to observe on patterns. From 2011 onwards, there seems to be small difference in BorrowerAPR for different Terms. This behaviour has changed with time. In 2012, for same EstimatedLoss, Term 60 loans had highest BorrowerAPR while Term 12 loans had lowest BorrowerAPR. In 2013, Q1 and Q2 show same trend but this changes during Q3 and finally in Q4, Term 36 loans have highest BorrowerAPR while Term 60 has lowest BorrowerAPR.
The plots in this section show that a model can be build to predict BorrowerAPR using EstimatedLoss, Year, ProsperScore, Term, Quarter and LoanAmtFactor. I will develop a linear model with these variables for a subset of loans that originated after, and including, 2011. Though, we have seen that for period before 2011, EstimatedLoss can predict 90% of the BorrowerAPR variation but it seems clear that the BorrowerAPR behaviour changed from 2011, which removed the vertical bands in the BorrowerAPR vs EstimatedLoss scatterplot. Hence, different models should be created to reflect the change in behaviour. Here I am developing model for 2011 onwards as that it is a more recent period.
##
## Calls:
## lm1: lm(formula = I(BorrowerAPR) ~ I(EstimatedLoss), data = loans.sub.2.sub_year.2011_plus)
## lm2: lm(formula = I(BorrowerAPR) ~ I(EstimatedLoss) + Year, data = loans.sub.2.sub_year.2011_plus)
## lm3: lm(formula = I(BorrowerAPR) ~ I(EstimatedLoss) + Year + ProsperScore,
## data = loans.sub.2.sub_year.2011_plus)
## lm4: lm(formula = I(BorrowerAPR) ~ I(EstimatedLoss) + Year + ProsperScore +
## Term, data = loans.sub.2.sub_year.2011_plus)
## lm5: lm(formula = I(BorrowerAPR) ~ I(EstimatedLoss) + Year + ProsperScore +
## Term + Quarter + LoanAmtFactor, data = loans.sub.2.sub_year.2011_plus)
##
## ====================================================================================================
## lm1 lm2 lm3 lm4 lm5
## ----------------------------------------------------------------------------------------------------
## (Intercept) 0.090*** 0.100*** 0.111*** 0.107*** 0.105***
## (0.000) (0.000) (0.001) (0.001) (0.001)
## I(EstimatedLoss) 1.709*** 1.667*** 1.554*** 1.554*** 1.568***
## (0.001) (0.001) (0.002) (0.002) (0.002)
## Year: 2012/2011 0.001** -0.000* -0.002*** -0.002***
## (0.000) (0.000) (0.000) (0.000)
## Year: 2013/2011 -0.008*** -0.012*** -0.014*** -0.013***
## (0.000) (0.000) (0.000) (0.000)
## Year: 2014/2011 -0.021*** -0.024*** -0.026*** -0.025***
## (0.000) (0.000) (0.000) (0.000)
## ProsperScore: 2 0.006*** 0.007*** 0.007***
## (0.001) (0.001) (0.001)
## ProsperScore: 3 0.005*** 0.005*** 0.005***
## (0.001) (0.001) (0.001)
## ProsperScore: 4 0.004*** 0.004*** 0.003***
## (0.001) (0.001) (0.001)
## ProsperScore: 5 0.006*** 0.005*** 0.005***
## (0.001) (0.001) (0.001)
## ProsperScore: 6 0.006*** 0.005*** 0.005***
## (0.001) (0.001) (0.001)
## ProsperScore: 7 0.004*** 0.003*** 0.003***
## (0.001) (0.001) (0.001)
## ProsperScore: 8 -0.005*** -0.006*** -0.006***
## (0.001) (0.001) (0.001)
## ProsperScore: 9 -0.013*** -0.014*** -0.014***
## (0.001) (0.001) (0.001)
## ProsperScore: 10 -0.024*** -0.024*** -0.024***
## (0.001) (0.001) (0.001)
## ProsperScore: 11 -0.020*** -0.020*** -0.018***
## (0.001) (0.001) (0.001)
## Term: 36/12 0.004*** 0.003***
## (0.000) (0.000)
## Term: 60/12 0.011*** 0.010***
## (0.000) (0.000)
## Quarter: Q2/Q1 0.003***
## (0.000)
## Quarter: Q3/Q1 0.003***
## (0.000)
## Quarter: Q4/Q1 -0.003***
## (0.000)
## LoanAmtFactor: B -0.001***
## (0.000)
## LoanAmtFactor: C 0.003***
## (0.000)
## LoanAmtFactor: D 0.001**
## (0.000)
## LoanAmtFactor: E 0.001*
## (0.000)
## ----------------------------------------------------------------------------------------------------
## R-squared 0.947 0.955 0.963 0.965 0.966
## adj. R-squared 0.947 0.955 0.963 0.965 0.966
## sigma 0.018 0.016 0.015 0.014 0.014
## F 1380124.496 409751.704 142535.688 132011.249 95361.037
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood 201891.460 208212.173 215511.062 217628.613 219038.125
## Deviance 24.382 20.703 17.140 16.227 15.645
## AIC -403776.920 -416412.345 -430990.123 -435221.226 -438026.249
## BIC -403749.154 -416356.813 -430842.036 -435054.628 -437794.864
## N 77298 77298 77298 77298 77298
## ====================================================================================================
The variables in the linear model (lm5) account for 96.6% of the variation in BorrowerAPR. Even with just EstimatedLoss, the model accounts for 94.7% of the variation.
In this Section, I realised that the year of origination has a significant role is understanding the relationships. The BorrowerAPR behaviour with EstimatedLoss changes from 2011 onwards. Over time, some relationships such as Term of loan to BorrowerAPR have also changed. Inclusion of year of origination has improved the understanding of the relationship between the variables.
I found the change in behaviour of BorrowerAPR to EstimatedLoss around 2010 interesting. The contribution of Term to BorrowerAPR also changes over time. Both these features were unexpected and raise my curiousity over the rationale behind them.
Yes, I have created a linear model that can explain 96.6% of the variation in BorrowerAPR. I have been able to identify EstimatedLoss as the key variable that alone explains 94.7% of the variation. I have identified multiple variables that influence the relationship between BorrowerAPR and EstimatedLoss. The model has been developed for period from 2011 onwards and hence, should not be used for period before then. The relationships between the variables have changed over time and so it is possible that the model also have to be re-developed to account for the changes. Some relationships are not linear, such as that of the Year with other variables. Other models, beyond linear, should be explored to achieve greater accuracy.
The distribution of the borrow rates seems to have multiple peaks. The peak at borrow rates above 35% seems to stand out. This could be perhaps due to the concentration of loans with similar features, such as Term.
Loans with the highest amount have the lowest median borrow rate. Generally, loans with small amounts have a much higher median borrow rate. The density plot shows that the smallest size loans, symbolised by LoanAmtFactor of A, have most of its loan issued at high values of borrow rate. As we go from Loan Factor A to E, the loans have a much more even distribution across Borrower’s APR.
The vertical lines on the Borrower’s APR vs Estimated Loss scatterplots are present for years before (and including) 2010 only. This implies that some relationships have changed at the end of 2010 to remove the vertical bands and provide an almost linear profile on these scatter plots. The linear profile implies that a linear model can be constructed to predict the borrow rate using Estimated Loss. Looking at 2013 closely, the relationship of Term with the linear relationship can be seen changing as we go from Q1 to Q4. This shows that Term should also be used in developing the model but its contribution is non-linear as it varies with other variables, Year and Quarter.
The Prosper loans data set contains almost 114,000 loans with 81 variables. I started by creating 2 new variables and taking a subset of 17 variables to analyse. I investigated each variable to understand what they represent and their behaviour. I observed the relationships between the variables and more specifically with the borrow rate of the loans. During the process, I explored interesting questions and came across some unexpected results. Finally, I developed a linear model to predict the borrow rates for loans that originate from 2011 onwards.
There is a clear relationship between Estimated Loss and the borrow rate but there remained a fair amount of unexplained variations. As I explored other variables, I found some additional relationships to explain the variation in borrow rate. However, even with these variables, I struggled to explain the variation in borrow rate for same estimated loss that could be seen as vertical lines in BorrowerAPR and EstimatedLoss scatterplot. Earlier in my analysis, I had observed a sudden drop in loan origination in 2009 and then a consistent increase in origination over the years. I wondered if there is any changes in modelling or customer behaviour around 2009 that I should consider. I was very surprised to find that the unexplained variation, that is the vertical lines in the scatterplot were all coming from the years 2009 and 2010. This was unexpected and the data set is unable to provide a reason for this. I further explored the data using the variables I had identified earlier and analysed the relationships by year and quarter of loan origination. I was surprised to see change in Term’s relationship to the borrow rate over time and was able to see how it had gradually changed over the quarters in 2013. In the end, I have developed a linear model for loans orginating from 2011 onwards and can explain 96.6% variation in the borrow rate. I have used Estimated Loss, Year, Quarter, ProsperScore, LoanAmtFactor and Term to develop this model.
The linear model has some limitations. It has been developed for period from 2011 onwards and should not be used for earlier periods in the data set. The model may need to be re-developed if the relationship between variables changes as we have seen in the data set. Some relationships such as that of the Year with other variables are not linear. For future work, I would like to test non-Linear models to explain a greater percentage in borrow rate variation. I would also like to analyse the variables that were filtered out to find an explanation for the unexpected changes in the relationships observed in this analysis.